The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.
There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.
Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.
Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.
#Read in the datasets
wine.white <- read.csv("winequality-white.csv", header = TRUE, sep=";")
wine.red <- read.csv("winequality-red.csv", header = TRUE, sep=";")
#Some exploratory analysis to get a feel for the data
head(wine.white)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
dim(wine.white)
## [1] 4898 12
summary(wine.white)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
#It also appears that there are some outliers that we may want to plot (especially for fixed acidity).
#We are also using the numeric summary to confirm that numerics have been read in properly and not as factors or strings
plotType <- function(domain=wine.white,FUN=hist){
FUN = match.fun(FUN)
par(mfrow=c(3,4))
for (i in 1:dim(domain)[2]){
FUN(domain[,i],xlab = names(domain)[i],main = "")
}
}
plotType(domain = wine.white, FUN=plot)
plotType(domain = wine.white, FUN=boxplot)
#From this initial plot, we can see that the following variables have outliers that should be removed: fixed.acidity, citric acid, residual sugar, free sulfur dioxide, total sulfur dioxide and density
par(mfrow=c(1,1))
attach(wine.white)
plot(citric.acid)
plot(citric.acid[1400:1700])
# There is also something strange going on for citric acid between index 1400 -1700. It looks like this may be a data entry error as the values only take on 3 discrete values. lets remove these first
#Removing records with the strange citric acid values.
wwc <- wine.white[c(1:1400,1700:nrow(wine.white)),]
#Next,we will remove the outliers identified
wwc <- subset(wwc,
fixed.acidity < 12 &
citric.acid < 1.5 &
residual.sugar < 30 &
free.sulfur.dioxide < 200 &
total.sulfur.dioxide <400 &
density < 1)
plotType(domain = wwc, FUN=hist)
#From the histogram, we can see that many distributions are right skewed.
pairs(wwc)
#We can also see from the correlation plot, that a few variables are showing signs of heteroskedacity so this will need to be fixed.
#Looking over the previous evidence, and that for the red wine, the 6 variables above were log-transformed. This was kept consistent between both red and white wine as we are combining them later.
wwm.all <- mutate(wwc, fixed.acidity=log(fixed.acidity),
volatile.acidity=log(volatile.acidity),
chlorides=log(chlorides),
free.sulfur.dioxide=log(free.sulfur.dioxide),
total.sulfur.dioxide=log(total.sulfur.dioxide),
sulphates=log(sulphates)
)
# The strength of the correlation is not obvious from the pairs plot, and the standard output from cor() is not so easy to interpret, so using the psych library to display it all together
psych::pairs.panels(wwm.all)
cor.wm <- as.data.frame(as.table(cor(wwm.all)))
subset(cor.wm, abs(Freq) > 0.5 & Var1 != Var2) %>% arrange(-abs(Freq))
## Var1 Var2 Freq
## 1 density residual.sugar 0.8185135
## 2 residual.sugar density 0.8185135
## 3 alcohol density -0.8096686
## 4 density alcohol -0.8096686
## 5 total.sulfur.dioxide free.sulfur.dioxide 0.6243473
## 6 free.sulfur.dioxide total.sulfur.dioxide 0.6243473
## 7 density total.sulfur.dioxide 0.5095204
## 8 total.sulfur.dioxide density 0.5095204
## 9 alcohol chlorides -0.5054419
## 10 chlorides alcohol -0.5054419
#Using a cutoff of .5 to show highly correlated variables.
#We can see that the highest correlation is between density and residual sugar, then alcohol and density. Then alcholol with chlorides
# Finally, going to create a test and train set
#smp.size <- floor(0.8 * nrow(wwm.all))
## set the seed to make your partition reproductible
#set.seed(123)
#train.ind <- sample(seq_len(nrow(wwm.all)), size = smp.size)
#wwm <- wwm.all[train.ind, ]
#wwm.test <- wwm.all[-train.ind, ]
wwm <- wwm.all
###############################################
# Repeating above analysis for red wine
head(wine.red)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
dim(wine.red)
## [1] 1599 12
summary(wine.red)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
# Checking for outliers
plotType(domain = wine.red, FUN=plot)
plotType(domain = wine.red, FUN=boxplot)
par(mfrow=c(1,1))
attach(wine.red)
## The following objects are masked from wine.white:
##
## alcohol, chlorides, citric.acid, density, fixed.acidity,
## free.sulfur.dioxide, pH, quality, residual.sugar, sulphates,
## total.sulfur.dioxide, volatile.acidity
plot(citric.acid)
plot(citric.acid[500:600])
# There is a similar issue one indexs 5-600. Removing these records
rwc <- wine.red[c(1:1400,1700:nrow(wine.red)),]
# removing outliers
rwc <- subset(rwc,
volatile.acidity < 1.5 &
citric.acid < 1 &
residual.sugar < 15 &
total.sulfur.dioxide < 250 &
chlorides < .6 &
sulphates < 1.8)
#Checking distribution
plotType(domain = rwc, FUN=hist)
pairs(rwc)
#applying same transformations as was done with red wine.
rwm.all <- mutate(rwc, fixed.acidity=log(fixed.acidity),
volatile.acidity=log(volatile.acidity),
chlorides=log(chlorides),
free.sulfur.dioxide=log(free.sulfur.dioxide),
total.sulfur.dioxide=log(total.sulfur.dioxide),
sulphates=log(sulphates)
)
psych::pairs.panels(rwm.all)
cor.rm <- as.data.frame(as.table(cor(rwm.all)))
subset(cor.rm, abs(Freq) > 0.5 & Var1 != Var2) %>% arrange(-abs(Freq))
## Var1 Var2 Freq
## 1 total.sulfur.dioxide free.sulfur.dioxide 0.7868810
## 2 free.sulfur.dioxide total.sulfur.dioxide 0.7868810
## 3 pH fixed.acidity -0.7027256
## 4 fixed.acidity pH -0.7027256
## 5 citric.acid fixed.acidity 0.6852071
## 6 fixed.acidity citric.acid 0.6852071
## 7 density fixed.acidity 0.6700530
## 8 fixed.acidity density 0.6700530
## 9 citric.acid volatile.acidity -0.5538977
## 10 volatile.acidity citric.acid -0.5538977
## 11 pH citric.acid -0.5402992
## 12 citric.acid pH -0.5402992
# The most highly correlated variables we are seeing here are not the same as with the red wine. The highest correlated are total.sulfur.dioxide with free.sulfur dioxide. For both red and white wine, alcohol is the highest correlated variable with the quality
#smp.size <- floor(0.8 * nrow(rwm.all))
## set the seed to make your partition reproductible
#set.seed(123)
#train.ind <- sample(seq_len(nrow(rwm.all)), size = smp.size)
#rwm <- rwm.all[train.ind, ]
#rwm.test <- rwm.all[-train.ind, ]
rwm <- rwm.all
Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.
regSubAllMeths <- function(dat){
summary.metrics <- NULL
which.all <- list()
rs.all <- list()
for ( my.mthd in c("exhaustive", "backward", "forward", "seqrep") ) {
rs.res <- regsubsets(quality~.,dat,method=my.mthd,nvmax=11)
summ.res <- summary(rs.res)
which.all[[my.mthd]] <- summ.res$which
#rsAll [[my.mthd]] <- rsRes
for ( metric.name in c("rsq","rss","adjr2","cp","bic") ) {
summary.metrics <- rbind(summary.metrics,
data.frame(method=my.mthd,metric=metric.name,
nvars=1:length(summ.res[[metric.name]]),
value=summ.res[[metric.name]]))
}
}
list(summary.metrics, which.all)
}
res.all.white <- regSubAllMeths(wwm)
sum.met.white <- data.frame(res.all.white[1])
which.white <- res.all.white[2][[1]]
# Plot of the number of variables against each metric score by method. To help us decide the optimal number of variables
ggplot(sum.met.white,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
#Retrieve "best" number of subsets for each method/metric
result.white <- sum.met.white %>%
group_by(method,metric) %>%
filter((value == max(value) & metric %in% c("rsq","adjr2"))
| (value == min(value) & metric %in% c("rss","cp","bic")))
result.white
## Source: local data frame [20 x 4]
## Groups: method, metric [20]
##
## method metric nvars value
## <fctr> <fctr> <int> <dbl>
## 1 exhaustive rsq 11 0.3093220
## 2 exhaustive rss 11 2428.2856334
## 3 exhaustive adjr2 9 0.3079285
## 4 exhaustive cp 8 7.2594258
## 5 exhaustive bic 8 -1593.9190342
## 6 backward rsq 11 0.3093220
## 7 backward rss 11 2428.2856334
## 8 backward adjr2 9 0.3079285
## 9 backward cp 8 7.2594258
## 10 backward bic 8 -1593.9190342
## 11 forward rsq 11 0.3093220
## 12 forward rss 11 2428.2856334
## 13 forward adjr2 9 0.3079285
## 14 forward cp 8 7.2594258
## 15 forward bic 8 -1593.9190342
## 16 seqrep rsq 11 0.3093220
## 17 seqrep rss 11 2428.2856334
## 18 seqrep adjr2 9 0.3079285
## 19 seqrep cp 9 8.0890750
## 20 seqrep bic 7 -1592.3390335
#From the graph and the summary, we can see that adjusted Rsqare favours 9, cp and BIC favor 8 variables across most models. From the graph we can see that there is a negligible change in RSS between 7 and 8 variables, and in fact we could probably justify using 6 variables.
#Number of variables to select
parmn <- 9
regsub.white <- data.frame(exhaustive = which.white$exhaustive[parmn,],
backward = which.white$backward[parmn,],
forward = which.white$forward[parmn,],
seqrep = which.white$seqrep[parmn,])
kable(regsub.white)
| exhaustive | backward | forward | seqrep | |
|---|---|---|---|---|
| (Intercept) | TRUE | TRUE | TRUE | TRUE |
| fixed.acidity | TRUE | TRUE | TRUE | TRUE |
| volatile.acidity | TRUE | TRUE | TRUE | TRUE |
| citric.acid | FALSE | FALSE | FALSE | FALSE |
| residual.sugar | TRUE | TRUE | TRUE | TRUE |
| chlorides | TRUE | TRUE | TRUE | TRUE |
| free.sulfur.dioxide | TRUE | TRUE | TRUE | TRUE |
| total.sulfur.dioxide | FALSE | FALSE | FALSE | FALSE |
| density | TRUE | TRUE | TRUE | TRUE |
| pH | TRUE | TRUE | TRUE | TRUE |
| sulphates | TRUE | TRUE | TRUE | TRUE |
| alcohol | TRUE | TRUE | TRUE | TRUE |
#With 9 variables, all methods produced models using the same variables
w.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+chlorides+free.sulfur.dioxide+density+pH+sulphates+alcohol,data=wwm)
cat("White wine: MSE of ",cv.glm(wwm, w.glm, K=10)$delta[1])
## White wine: MSE of 0.5407825
########################################################
# Red wine section
# Repeating above analysis for red whine
res.all.red <- regSubAllMeths(rwm)
sum.met.red <- data.frame(res.all.red[1])
which.red <- res.all.red[2][[1]]
ggplot(sum.met.red,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
sum.met.red %>%
group_by(method,metric) %>%
filter((value == max(value) & metric %in% c("rsq","adjr2"))
| (value == min(value) & metric %in% c("rss","cp","bic")))
## Source: local data frame [20 x 4]
## Groups: method, metric [20]
##
## method metric nvars value
## <fctr> <fctr> <int> <dbl>
## 1 exhaustive rsq 11 0.3764951
## 2 exhaustive rss 11 555.7077014
## 3 exhaustive adjr2 9 0.3723567
## 4 exhaustive cp 8 7.7392822
## 5 exhaustive bic 7 -595.8945564
## 6 backward rsq 11 0.3764951
## 7 backward rss 11 555.7077014
## 8 backward adjr2 9 0.3723567
## 9 backward cp 8 7.7392822
## 10 backward bic 7 -595.8945564
## 11 forward rsq 11 0.3764951
## 12 forward rss 11 555.7077014
## 13 forward adjr2 9 0.3723567
## 14 forward cp 8 7.7392822
## 15 forward bic 7 -595.8945564
## 16 seqrep rsq 11 0.3764951
## 17 seqrep rss 11 555.7077014
## 18 seqrep adjr2 8 0.3720980
## 19 seqrep cp 8 7.7392822
## 20 seqrep bic 7 -595.8945564
parmn <- 9
regsub.red <- data.frame(exhaustive = which.red$exhaustive[parmn,],
backward = which.red$backward[parmn,],
forward = which.red$forward[parmn,],
seqrep = which.red$seqrep[parmn,])
kable(regsub.red)
| exhaustive | backward | forward | seqrep | |
|---|---|---|---|---|
| (Intercept) | TRUE | TRUE | TRUE | TRUE |
| fixed.acidity | TRUE | TRUE | TRUE | TRUE |
| volatile.acidity | TRUE | TRUE | TRUE | TRUE |
| citric.acid | TRUE | TRUE | TRUE | TRUE |
| residual.sugar | FALSE | FALSE | FALSE | TRUE |
| chlorides | TRUE | TRUE | TRUE | TRUE |
| free.sulfur.dioxide | TRUE | TRUE | TRUE | TRUE |
| total.sulfur.dioxide | TRUE | TRUE | TRUE | TRUE |
| density | FALSE | FALSE | FALSE | TRUE |
| pH | TRUE | TRUE | TRUE | TRUE |
| sulphates | TRUE | TRUE | TRUE | FALSE |
| alcohol | TRUE | TRUE | TRUE | FALSE |
# Although we get the same number of parameters for the models for red and white wine,
# our models don't agree on which variables to select. This is likely due to the different characteristics
# that contribute to a good red wine vs a good white wine.
r.glm <- glm(quality~ fixed.acidity+volatile.acidity+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,data=rwm)
#Smaller MSE than what was observed for white wine
cat("Red wine: MSE of ",cv.glm(rwm, r.glm, K=10)$delta[1])
## Red wine: MSE of 0.40728
Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.
# To compensate from the lack of predict method for regsubsets
# Taken from ISLR
predict.regsubsets =function (object ,newdata ,id ,...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
# number of folds
k=10
#For reproducibility with sample function
set.seed(1)
folds=sample(1:k,nrow(wwm),replace=TRUE)
cv.errors.white=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
# For each fold, we need to loop through the models with upto 11 variables
for(j in 1:k){
best.fit = regsubsets(quality~., data=wwm[folds!=j,], nvmax=11)
for(i in 1:11){
pred=predict(best.fit,wwm[folds==j,],id=i)
#Store the MSE for each model
cv.errors.white[j,i]=mean((wwm$quality[folds==j]-pred)^2)
}
}
mean.cv.errors.white <-apply(cv.errors.white ,2,mean)
par(mfrow=c(1,1))
plot(mean.cv.errors.white ,type="b", main = "10-fold CV feature selection (white wine)")
cat("White wine (k-fold): Lowest MSE on", which.min(mean.cv.errors.white) ,"variables")
## White wine (k-fold): Lowest MSE on 9 variables
#Re run the full regsubsets to ge the list of variables used
reg.best=regsubsets(quality~.,data=wwm , nvmax=11)
# Variables selected for the model with lowest MSE
kf.white <- summary(reg.best)$which[which.min(mean.cv.errors.white),]
########################################################
# Red wine section
#Repeat the above analysis
set.seed(1)
folds=sample(1:k,nrow(rwm),replace=TRUE)
cv.errors.red=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
# For each fold, we need to loop through the models with upto 11 variables
for(j in 1:k){
best.fit = regsubsets(quality~., data=rwm[folds!=j,], nvmax=11)
for(i in 1:11){
pred=predict(best.fit,rwm[folds==j,],id=i)
#Store the MSE for each model
cv.errors.red[j,i]=mean((rwm$quality[folds==j]-pred)^2)
}
}
mean.cv.errors.red <-apply(cv.errors.red ,2,mean)
par(mfrow=c(1,1))
plot(mean.cv.errors.red ,type="b", main = "10-fold CV feature selection (red wine)")
cat("Red wine (k-fold): Lowest MSE on", which.min(mean.cv.errors.red) ,"variables")
## Red wine (k-fold): Lowest MSE on 7 variables
#Re run the full regsubsets to ge the list of variables used
reg.best=regsubsets(quality~.,data=wwm , nvmax=11)
# Variables selected for the model with lowest MSE
kf.red <- summary(reg.best)$which[which.min(mean.cv.errors.red),]
#Comparing Variables selected from both red and white wines
kable(data.frame(kf.red, kf.white ))
| kf.red | kf.white | |
|---|---|---|
| (Intercept) | TRUE | TRUE |
| fixed.acidity | TRUE | TRUE |
| volatile.acidity | TRUE | TRUE |
| citric.acid | FALSE | FALSE |
| residual.sugar | TRUE | TRUE |
| chlorides | FALSE | TRUE |
| free.sulfur.dioxide | TRUE | TRUE |
| total.sulfur.dioxide | FALSE | FALSE |
| density | TRUE | TRUE |
| pH | TRUE | TRUE |
| sulphates | TRUE | TRUE |
| alcohol | FALSE | TRUE |
#Model has larger MSE than what was observed for red wine using regsubsets
kr.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+free.sulfur.dioxide+pH+sulphates+alcohol,data=rwm)
cat("Red wine (k-fold): MSE of ",cv.glm(rwm, kr.glm, K=10)$delta[1])
## Red wine (k-fold): MSE of 0.417774
#Model has larger MSE than what was observed for white wine using regsubsets
kw.glm <- glm(quality~ fixed.acidity+volatile.acidity+residual.sugar+free.sulfur.dioxide+pH+sulphates+alcohol,data=wwm)
cat("White wine(k-fold): MSE of ",cv.glm(wwm, kw.glm, K=10)$delta[1])
## White wine(k-fold): MSE of 0.5532324
#Regsubsets out-performed both models from cross-validation although the MSE was very close.
# For the two wine types, the k-fold cross-validated regsubsets approach has picked the same variables
# which is quite different from what was seen in the normal regsubsets
# We agree on 6 variables across the four models.
# I suspect that the k-fold models are dropping chlorides due to the high correlation we observed with alcohol which is also being included in the model.
kable(data.frame(regsub.red$exhaustive, regsub.white$exhaustive,kf.red , kf.white))
| regsub.red.exhaustive | regsub.white.exhaustive | kf.red | kf.white | |
|---|---|---|---|---|
| (Intercept) | TRUE | TRUE | TRUE | TRUE |
| fixed.acidity | TRUE | TRUE | TRUE | TRUE |
| volatile.acidity | TRUE | TRUE | TRUE | TRUE |
| citric.acid | TRUE | FALSE | FALSE | FALSE |
| residual.sugar | FALSE | TRUE | TRUE | TRUE |
| chlorides | TRUE | TRUE | FALSE | TRUE |
| free.sulfur.dioxide | TRUE | TRUE | TRUE | TRUE |
| total.sulfur.dioxide | TRUE | FALSE | FALSE | FALSE |
| density | FALSE | TRUE | TRUE | TRUE |
| pH | TRUE | TRUE | TRUE | TRUE |
| sulphates | TRUE | TRUE | TRUE | TRUE |
| alcohol | TRUE | TRUE | FALSE | TRUE |
Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.
ridge.lasso.reg <- function(dat, lbl="white wine ridge", a=1){
# put into matrix form
r.x <- model.matrix(quality~.,dat)[,-1]
r.y <- dat[,"quality"]
# Create train and test sets
set.seed (1)
r.train=sample(1:nrow(r.x), nrow(r.x)*.8)
r.test=(-r.train)
r.y.test=r.y[r.test]
#Use cross-validation to find best value of lambda
cv.out=cv.glmnet(r.x[r.train ,],r.y[r.train],alpha=a)
plot(cv.out)
bestlam=cv.out$lambda.min
#Test model on test set
ridge.pred=predict(cv.out,s=bestlam ,newx=r.x[r.test,])
cat(lbl, " Test MSE:",mean((ridge.pred-r.y.test)^2))
#Return coeff for regression model using optimal lambda
grid=10^seq(10,-2,length=100)
out=glmnet(r.x,r.y,alpha=1,lambda=grid)
predict(out,type="coefficients",s=bestlam)[,1]
}
#Ridge regression for white and red wines
ww.ridge = ridge.lasso.reg(wwm, lbl="White Wine (ridge regression)", a=0)
## White Wine (ridge regression) Test MSE: 0.5511096
rw.ridge = ridge.lasso.reg(rwm, lbl="Red Wine (ridge regression)", a=0)
## Red Wine (ridge regression) Test MSE: 0.4045869
#Lasso regressions for white and red wines
ww.lasso = ridge.lasso.reg(wwm, lbl="White Wine (lasso regression)", a=1)
## White Wine (lasso regression) Test MSE: 0.5472413
rw.lasso = ridge.lasso.reg(rwm, lbl="Red Wine (lasso regression)", a=1)
## Red Wine (lasso regression) Test MSE: 0.4041392
kable(data.frame(reg.red=regsub.red$exhaustive, reg.white=regsub.white$exhaustive,kf.red , kf.white, las.white= ww.lasso>0, las.red=rw.lasso>0))
| reg.red | reg.white | kf.red | kf.white | las.white | las.red | |
|---|---|---|---|---|---|---|
| (Intercept) | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| fixed.acidity | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE |
| volatile.acidity | TRUE | TRUE | TRUE | TRUE | FALSE | FALSE |
| citric.acid | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| residual.sugar | FALSE | TRUE | TRUE | TRUE | TRUE | FALSE |
| chlorides | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE |
| free.sulfur.dioxide | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| total.sulfur.dioxide | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE |
| density | FALSE | TRUE | TRUE | TRUE | FALSE | FALSE |
| pH | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE |
| sulphates | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE |
| alcohol | TRUE | TRUE | FALSE | TRUE | TRUE | TRUE |
#Not displaying the parameters from ridge regression as it contains all variables
#Both ridge and lasso had very similar Test MSEs to each other, however, in general, they had lower test MSEs than the other methods looked at (esp for red wine), additionally, lasso regression only selected 6 variables. However, these differences could be due to only using .8 % of the data as the training set here.
Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.
#Make copies of the datasets
white.pc <- wwm
red.pc <- rwm
#Add a source column
white.pc$source <- "white"
red.pc$source <- "red"
#Join datasets
all.pc <- rbind(white.pc, red.pc)
names(all.pc)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "source"
# Remove Quality and Source
pca <- all.pc[,-c(12:13)]
names(pca)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol"
# Calculate Principal components with Scaling
pca.scale <- prcomp(pca,scale=TRUE)
# variance explained by each principal components
plot(pca.scale)
# Scree plot for the proportion of variance explained by the principal components
pr.var <- pca.scale$sdev^2
prop.var <- pr.var/sum(pr.var)
plot(prop.var, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
#We can see that the first 4 principal components are needed to explain most of the variance
# Biplot of first 2 principal components
biplot(pca.scale, cex=.5)
# We can clearly see 2 discinct clusters, we will first group them by color using the Cols function from ISLR
# We can also see that total.sulfur.dioxide and free.sulfur.dioxide are highly correlated, as are residual.sugar and citric acid
# White wines can be characterised more by, total.sulfur.dioxide and residual.sugar. Whereas reds are characterised more by volatile acid and sulphates.
Cols=function(vec){
cols=rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
# The first 2 principal components dont do a good job of grouping based on quality
#this isnt a big surprise as the linear models didnt do a great job of predicting the quality either.
plot(pca.scale$x[,1:2], col=Cols(all.pc$quality), pch=19,
xlab="Z1",ylab="Z2")
# The first two principal components do a great job of identifying wine types
# it shows that there is a lot of variance in the predictors based on wine type
# This is in line with what was observed earlier, when using regsubsets to estimate important
# predictor variables based on wine type.
plot(pca.scale$x[,1:2], col=Cols(all.pc$source), pch=19,
xlab="Z1",ylab="Z2")
Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.
l.rwm = rwm[,-12]
head(l.rwm)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 2.001480 -0.3566749 0.00 1.9 -2.577022
## 2 2.054124 -0.1278334 0.00 2.6 -2.322788
## 3 2.054124 -0.2744368 0.04 2.3 -2.385967
## 4 2.415914 -1.2729657 0.56 1.9 -2.590267
## 5 2.001480 -0.3566749 0.00 1.9 -2.577022
## 6 2.001480 -0.4155154 0.00 1.8 -2.590267
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 2.397895 3.526361 0.9978 3.51 -0.5798185 9.4
## 2 3.218876 4.204693 0.9968 3.20 -0.3856625 9.8
## 3 2.708050 3.988984 0.9970 3.26 -0.4307829 9.8
## 4 2.833213 4.094345 0.9980 3.16 -0.5447272 9.8
## 5 2.397895 3.526361 0.9978 3.51 -0.5798185 9.4
## 6 2.564949 3.688879 0.9978 3.51 -0.5798185 9.4
pc <- prcomp(l.rwm)
#From earlier, we found that we needed the first 4 principal components to explain most of the variance
betas <- pc$x %*% t(pc$rotation)
p=list()
for (j in 1:nrow(l.rwm)){
for (i in 1:11){
betas[j,i] = betas[j,i] * l.rwm[j,i]
}
p[j]=sum(betas[j,])
}